StaticGreedy: solving the apparent scalability-accuracy 
dilemma in influence maximization 



Suqi Cheng Huawei Shen Junming Huang Guoqing Zhang Xueqi Cheng 

Institute of Computing Teclnnology, CIninese Academy of Sciences, Beijing, CInina 
{chengsuqi, shenhuawei, huangjunming, gqzhang, cxq}@ict.ac.cn 



m ABSTRACT 



General Terms 



O 

in 



00 

o 



> 



X 



Influence maximization, deflned as the problem of finding a 
set of seed nodes maximizing the spread of influence, is cru- 
cial to viral marketing on social networks. For practical viral 
marketing on large-scale social networks, it is required that 
influence maximization algorithms have both guaranteed ac- 
curacy and high scalability. However, existing algorithms 
suffer an apparent scalability-accuracy dilemma: Greedy al- 
gorithm and its improvements have guaranteed accuracy but 
are not scalable, while the accuracy of scalable heuristic al- 
gorithms is unstable and not guaranteed. 

In this paper, we focus on resolving this scalability-accuracy 
dilemma. We first find that the submodularity is unguaran- 
teed in existing implementations of greedy algorithm, caused 
by the independence among Monte Carlo simulations con- 
ducted in different iterations of greedy algorithm. A large 
number of Monte Carlo simulations are thus required in ex- 
isting greedy algorithms to alleviate the impact of unguar- 
anteed submodularity. Motivated by this critical finding, 
we propose a static greedy algorithm to strictly guarantee 
the submodularity property, by reusing the results of Monte 
Carlo simulations during the whole process of greedy algo- 
rithm. As a result, the proposed algorithm achieves the 
same accuracy with the state-of-the-art greedy algorithms, 
while the number of Monte Carlo simulations needed is dra- 
matically reduced by two orders of magnitude. Moreover, we 
give a dynamic update strategy to further improve the static 
greedy algorithm, by applying which our algorithm becomes 
comparable to the most scalable heuristic algorithm. 



Categories and Subject Descriptors 

F.2.2 [Analysis of Algorithms and Problem Complex- 
ity]: Non-numerical Algorithms and Problems; D.2.8 [Software 

Engineering]: Metrics — complexity measures, performance 
measures 
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1. INTRODUCTION 

We are witnessing the increasing prosperity of online so- 
cial network sites and social media sites, where people are 
connected by heterogeneous social relationships. These on- 
line social networks provide convenient platforms for infor- 
mation dissemination and marketing, allowing ideas and be- 
haviors to flow along the social relationships in the effective 
word-of-mouth manner. Many companies have made efforts 
to popularize or promote their brands or products on on- 
line social networks by launching campaigns akin to viral 
marketing. The success of viral marketing is rooted in the 
interpersonal influence, which has been empirically studied 
in various contexts [U [2j [S] |4l [5] . 

Influence maximization, formulated as a discrete optimiza- 
tion problem by Kempe, Kleinberg and Tardos [6], is a fun- 
damental problem for viral marketing. It aims to find a 
fixed-sized set of seed nodes, which maximizes the spread of 
infiuence on a social network. Its evaluation is the expected 
number of nodes infiuenced by the seed set, which is referred 
to as influence spread. The solution of infiuence maximiza- 
tion problem is closely related to infiuence spread models, 
which are used to model the process of influence spread. 
Two commonly-used models are the independent cascade 
model and the linear threshold model. With respect to the 
two models, Kempe et al. [6j prove that the influence max- 
imization problem is NP-hard, and they present a greedy 
approximation algorithm guaranteeing that the obtained in- 
fluence spread is no less than 1 — 1/e — eof the optimal value, 
where e depends on the accuracy of influence spread estima- 
tion. Such approximation guarantee is further extended to 
general threshold model [?]• Since there is no efficient algo- 
rithm to obtain the exact influence spread of a given seed 
set [8j [9], the average infiuence spread over a sufficiently 
large number of Monte Carlo simulations is widely used to 
approximate the exact value. 

However, the general greedy algorithm proposed by kempe 
et al. is not scalable for involving too many Monte Carlo 
simulations, and this limits its application to networks with 
small or moderate size. To overcome this problem, many 



efforts have been made to improve the scalabiHty of this 
general greedy algorithm [TOl ITll [121 |l3l [Ml [TS] [16] . These 
improvements are obtained along two directions. For the 
first one, some researchers attempt to reduce the times of 
influence spread estimations, i.e., computing the influence 
spread of certain node sets. For example, a "lazy-forward" 
strategy [lOl is proposed to effectively reduce the number of 
candidate nodes. Yet such reduction quickly hits its limit, 
caused by the requirements of guaranteed accuracy, which 
suggests a large number of Monte Carlo simulations in a 
single influence spread estimation. For the other direction, 
various heuristics are proposed to use more efficient methods 
for influence spread estimation, rather than Monte Carlo 
simulations. In the representative work conducted by Chen 
et al. [SI, the maximum influence paths between every pair 
of nodes are taken as delegates to estimate the influence 
spread for each node. However, the increase of scalability is 
obtained at the cost of unstable or unguaranteed accuracy. 
In a word, existing algorithms for influence maximization 
are facing the scalability-accuracy dilemma. 

This paper focuses on resolving the scalability-accuracy 
dilemma of influence maximization with respect to the in- 
dependent cascade model. We analyze the essential cause of 
the scalability-accuracy dilemma, and then propose a static 
greedy algorithm to combat it. Moreover, we further im- 
prove the scalability of the static greedy algorithm by a dy- 
namic update strategy. The contributions of this paper are 
summarized as follows: 

• We point out that the submodularity property is not 
guaranteed in existing implementations of greedy algo- 
rithm, which directly leads to the inefficiency of pre- 
vious greedy algorithms. This critical finding renews 
our knowledge about the greedy algorithm for influ- 
ence maximization and opens the door to resolve the 
accuracy-scalability dilemma. 

• We propose a static greedy algorithm to strictly guar- 
antee the submodularity property by reusing the re- 
sults of Monte Carlo simulations during the whole pro- 
cess of greedy algorithm. The proposed algorithm achieves 
the guaranteed accuracy with the number of Monte 
Carlo simulations dramatically reduced, thus it effec- 
tively improves the scalability of the greedy algorithm. 

• We give a strategy to further speedup our static greedy 
algorithm by dynamically updating the marginal gain 
of the candidate nodes. This strategy, which takes the 
advantage of static results of Monte Carlo simulations, 
improves the efficiency of our static greedy algorithm 
in a way that enables it to run 2-7 times faster than the 
CELF optimized static greedy algorithm in the testing 
of social networks used in this paper, and makes it 
comparable to the most scalable heuristic algorithm. 

2. RELATED WORK 

Influence maximization was first studied by Domingos and 
Richardson from the algorithmic perspective [1] [2j , and was 
then formulated as a discrete optimization problem by Kempe 
et al. [5]. They also propose a greedy algorithm, with guar- 
anteed accuracy caused by the monotone and submodularity 



properties of the objective function of infiuence maximiza- 
tion problem. However, this greedy algorithm is inefficient 
and not scalable to large scale social networks. 

Thus, several studies devote to optimize Kepme's greedy 
algorithm without affecting guaranteed accuracy. Leskovec 
et al. [lOl propose the "cost-effective lazy forward" strategy, 
namely CELF, for selecting new seeds by further exploit- 
ing the submodularity property of influence maximization. 
The CELF strategy can greatly reduce the number of eval- 
uations on the influence spread of nodes. This strategy is 
further improved to a CELF-I-+ strategy [16], which sug- 
gests simultaneously calculating the influence spread used 
in successive iterations of greedy algorithm. NewGreedy al- 
gorithm [11] reuses the results of Monte Carlo simulations to 
estimate the influence spread for all candidate nodes in the 
same iteration. It has been further developed into Mixed- 
Greedy algorithm to integrate the advantages of the CELF 
strategy and the NewGreedy algorithm. 

Unfortunately, those improved greedy algorithms are still 
inefficient for involving too many Monte Carlo simulations 
for influence spread estimation. Hence, several heuristic al- 
gorithms for the independent cascade model are proposed 
to improve the scalability of greedy algorithm by simplify- 
ing influence spread estimation. Chen et al. 11 suggest 
a degree discount heuristics for influence maximization on 
uniform independent cascade model. Wang et al. [13] di- 
vide a network into communities and conduct Monte Carlo 
simulations within each community instead of the whole net- 
work. Luo et al. [T^ propose to conduct the greedy algo- 
rithm on a small set of nodes, which consists of the top nodes 
ranked by PageRank algorithm on social network. Kimura 
and Saito [12] propose the shortest-path based influence cas- 
cade models and provide efficient algorithms to compute the 
influence spread under these models. Instead of using the 
simple shortest path, PMIA algorithm [8] [18] uses maxi- 
mum influence paths for influence spread estimation, and 
this algorithm is believed to be the best heuristic algorithm 
so far. However, these heuristics may violate the guaran- 
teed accuracy of greedy algorithm and thus one may concern 
about the reliability of these heuristics. 

In addition, several influence maximization algorithms are 
beyond the framework of greedy algorithm. Jiang et al. [15] 
suggest a simulated annealing approach with several heuris- 
tics to speed up the computation of the influence spread. 
Narayanam and Narahari JA^ give a way to improve the 
scalability of influence maximization using the concept of 
Shapley value borrowed from the cooperative game theory. 
Mathioudakis et al. [19] suggest removing some unimpor- 
tant edges for influence propagation to accelerate influence 
computation algorithms. 

3. STATIC GREEDY ALGORITHM 
3.1 Influence maximization problem 

We consider the influence maximization problem with re- 
spect to the independent cascade(IC) model. For a directed 
graph G — {V,E), each edge {u,v) £ E is associated with 
a propagation probability p{u,v), denoting the probability 
that V is activated by u through the edge (it, v) after u is ac- 
tivated. In the IC model, whether u activates v is fully deter- 
mined by p(m, v), which is independent from the probabilities 



associated with other edges. A node cannot become inac- 
tive once it becomes active. Each node has only one chance 
to activate each of its inactive neighbors. Given a seed set 
S, its influence spread I{S) is defined as the expected num- 
ber of nodes eventually activated. Influence maximization 
problem aims at finding the set S which maximizes I{S), 
under the constraint that the size of S is no larger than a 
predefined positive integer k. 

To resolve the influence maximization problem, a method 
is needed to evaluate I{S) for a given 5'. However, it is in- 
tractable to exactly compute I{S) on a typically sized graph. 
In practice, Monte Carlo simulation is employed to estimate 
I{S), and it can be implemented in two different ways as 
follows: 

• Simulation. The influence spread is obtained by di- 
rectly simulating the random process of influence cas- 
cade from a seed set. Given a set S, the simulation is 
conducted as follows. Let Ai denote the set of nodes 
activated in the i-th iteration and thus Aq — S. Then, 
at the {i + l)-th iteration, for each node u £ Ai, it 
attempts to activate its inactive neighbors and suc- 
cessfully activates v with the probability p{u, v) asso- 
ciated with the edge {u, v) . This process is repeated 
until no nodes are newly activated. For each random 
cascade, the number of eventually activated nodes is 
the influence spread of this single simulation. Finally, 
by running such random process of influence cascade 
for many times, we can estimate the influence spread 
7(5*) by averaging over all these simulations. 

• Snapshot. According to the characteristic of the IC 
model, snapshots can be obtained for the influence 
propagation graph G a priori. A snapshot is a graph 
G' , where an edge {u, v) is remained with the probabil- 
ity p{u,v). Note that each snapshot is an instance of 
the probability space comprising the graphs obtained 
by sampling the influence propagation graph G. For 
each snapshot G' , the influence spread of S is the num- 
ber of nodes reachable from S. Then, I{S) can be 
obtained by averaging over many snapshots. 

The results of above two methods are essentially equiva- 
lent and each has its own unique advantage. For estimat- 
ing the influence spread of a given node set, the simulation 
method is faster, because it only needs to explore a small 
portion of edges, while the snapshot method has to check all 
the edges in the graph. If we need to estimate the influence 
spread of many sets, the snapshot method outperforms the 
cascade method in terms of time complexity, since we can 
reuse the snapshots to calculate the influence spread for all 
candidates. 

3.2 Submodularity in greedy algorithms 

For greedy algorithms of influence maximization, to guar- 
antee the approximation within a factor of 1 — 1/e, it is 
required that the influence spread function /(•) is a mono- 
tone and submodular function, and its value can be evalu- 
ated exactly |20) . We say that a function /(•) is monotone if 
f{SU{v}) > f{S) for any set 5* and any element v ^ S, and 
/(■) is submodular iff /(S'U{u})-/(S') > /(TU {«}) - /(T) 
when S C T. The submodularity property is also explained 
as a natural "diminishing return" property. Although /(•) 
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Figure 1: Illustrations of unguaranteed submodular- 
ity property. 

is proven to be monotone and submodular (S], the value of 
/(■) cannot be calculated exactly. That is just the difficulty 
faced by implementing greedy algorithm for the influence 
maximization. In practice, Monte Carlo simulations is em- 
ployed to approximately estimate /(•). 

Unfortunately, things become different when Monte Carlo 
simulation is employed. In existing implementations of greedy 
algorithms, Monte Carlo simulations are conducted indepen- 
dently among different iterations. The spread along an edge 
(u, v) may fail in one Monte Carlo simulation and succeed in 
the other Monte Carlo simulation. As a result, the marginal 
gain I{Si U {v}) — I{Si) from adding v to the seed set Si 
in the i-th iteration might be lower than the marginal gain 
I{Si+i U {v}) — /(S'i+i) from adding v in the (i -I- l)-th it- 
eration. Therefore, after using Monte Carlo simulation, the 
submodularity property of the influence spread function /(■) 
becomes unguaranteed. Moreover, the monotone property 
also suffers the same problem. 

Figure [1] gives a simple example to illustrate the unguar- 
anteed submodularity caused by Monte Carlo simulation. 
Figure 1(a) depicts a graph with each edge associated with 
a certain propagation probability (e.g., 0.5). Figure 1(b) 



represents the result of one Monte Carlo simulation by re- 
maining only the edges along which the spread succeeds. 
Figure 1(c) represents the result of another Monte Carlo sim- 



ulation. Suppose Figure 1(b) |is us ed in the first iteration in a 
greedy algorithm and Figure ] 1(0} is used in the second itera- 
tion. Starting from an empty seed set So = <!>, the greedy al- 
gorithm selects node V2 as the seed node in the first iteration 
since V2 has the largest influence spread. In second iteration. 



we use the result of Monte Carlo simulation in Figure 1(c) 
and the marginal gain of adding 114 is 7(5*2 U {^4}) — 7(5*2) = 
I{{v2,V4}) — 7({t;2}) ~ 2. However, the marginal gain of vn 
is 7(5i U {Vi}) - 7(5i) = I{{V2,V4}) - I{{V2}) = accord- 



ing to the result of Monte Carlo simulation in Figure 1(b) 



That is to say, the marginal gain of V4, in the first iteration 
is smaller than in the second iteration. This is contradictory 
to the submodularity property. 

To avoid the unguaranteed submodularity caused by the 
independence between Monte Carlo simulations conducted 
among different iterations, existing greedy algorithms have 
to use an extremely large number 7? (typically R — 10, 000 
or 7? = 20, 000) of Monte Carlo simulations in each iter- 
ation. These algorithms want to guarantee the submodu- 
larity and monotone properties by closely approximate the 
exact influence spread. However, the submodularity and 
monotone properties can only be guaranteed with a cer- 
tain probability in this way because of the finite number 



of Monte Carlo simulations. This poses concern about the 
accuracy of existing greedy algorithms with unguaranteed 
submodularity and monotone properties. In addition, to 
improve the accuracy, existing greedy algorithms have to 
use a huge number of Monte Carlo simulations. This then 
results in expensive computational cost and limits the appli- 
cation of these greedy algorithms on real world social net- 
works with millions of nodes. Therefore, these algorithms 
suffer a scalability-accuracy dilemma and this dilemma has 
roots in the independence between Monte Carlo simulations 
conducted in different iterations of these greedy algorithms. 

3.3 Description of static greedy algorithm 

We have pointed out that, the key for combating the 
scalability-accuracy dilemma is the independence between 
Monte Carlo simulations conducted in different iterations 
of greedy algorithms. Along this line, we propose a new 
greedy algorithm which shares the results of Monte Carlo 
simulations in all the iterations of greedy algorithm. We 
call our algorithm a static greedy algorithm, namely Stat- 
icGreedy, because the results of Monte Carlo simulations 
can be generated a priori in terms of the snapshot manner, 
as described in Sections. 1 and are kept static in the whole 
greedy algorithm. 

Given an underlying social network G and a positive inte- 
ger k, the StaticGreedy algorithm seeks for a seed set S to 
maximize the influence spread I{S) according to the follow- 
ing process: 

1. Static snapshots: Randomly sampling i? snapshots from 
the underlying social network G. In each snapshot, 
each edge {u, v) is sampled with its associated proba- 
bility p(u, u); 

2. Greedy selection: Start from an empty seed set S, then 
iteratively add one node a time into S such that the 
node provides the largest marginal gain of I{S), which 
is estimated on the R snapshots. The process continues 
until k nodes have been selected. 

The StaticGreedy algorithm is formally described in Algo- 
rithm [T] Two main differences between this algorithm and 
existing greedy algorithms include: (I) Monte Carlo simu- 
lations are conducted in static snapshot manner, which are 
sampled before the greedy process of selecting seed nodes, 
as is shown in line [2] to [H (2) The same set of snapshots are 
reused in every iteration to estimate the influence spread 
I[S), where explains the meaning of "static". 

The StaticGreedy algorithm has two benefits: I) the ac- 
curacy is guaranteed since the submodularity and monotone 
properties of influence spread function are strictly guaran- 
teed, 2) the StaticGreedy algorithm is highly scalable. In 
StaticGreedy, it is not required that the influence spread 
should be accurately approximated by a large number R of 
Monte Carlo simulations. Thus, R could be significantly re- 
duced to a low magnitude. Now the requirements for R is 
that R snapshots can provide a sufficient and representative 
delegate of the underlying social network. Roughly speak- 
ing, this only requires that each edge can be observed in the 
R snapshots at least once. Thus, R is in the magnitude of 
0{l/pmin), where Pmin is the smallest propagation probabil- 
ity on all edges. For a typical social network with p — 0.01 



Algorithm 1 StaticGreedy(G,fc,i?) 
1: initialize S — 
2: for j = 1 to i? do 

3: generate G'i by removing each edge (u, v) from G with 
probability 1 — p{u, v) 

4: end for 

5: for i — 1 to k do 

6: set s„ = for aU w £ \ S 

7: for j — 1 to R do 

8: for all i; G 1/ \ S do 

9: +^ \R{G'j,SU{v})\ 

10: end for 
11: end for 

12: S = SU {arg max {sv/R}} 

v£V\S 

13: end for 

14: output S 




Figure 2: The relationship between d/f fe and 7? of 
CELFGreedy and StaticGreedy on NetHEPT net- 
work. 

or so, hundreds of snapshots could be enough to represent 
the social network. This value is much less than what the 
existing algorithms require, i.e., typically in the magnitude 
of 10, 000. 

3.4 Analysis of the StaticGreedy algorithm 

3.4.1 Accuracy 

To clarify the performance of the StaticGreedy algorithm 
compared with the original greedy algorithm, we illustrate 
the accuracy of these algorithms with respect to the num- 
ber R of Monte Carlo simulations on a benchmark network 
NetHEPT. This network consists of tens of thousands of 
physics researchers and their co-authorship relations. The 
employed baseline greedy algorithm is the CELFGreedy, which 
is the general greedy algorithm with CELF optimization. In 
addition, we choose two commonly-used IC models: the uni- 
form independent cascade (UIC) model with p = 0.01 and 
the weighted independent cascade (WIC) model introduced 
in Ref. J6 with a setting that p{u,v) — 1/kv, where fc„ is 
the indegree of node v. 

Since the optimal infiuence spread is unknown to us, the 
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Figure 3: Minimal number of snapshots needed to 
accurately find a solution. 

ground truth we use here for each value of set size k is the in- 
fluence spread of the solution SI obtained by CELFGreedy 
algorithm with typical setting, i.e., R — 20,000. To evaluate 
the relative difference between the influence spread obtained 
by a greedy algorithm and the ground truth, we use a mea- 
sure dfl,fc defined as 

where Sn^k is the set of seed nodes obtained by a greedy 
algorithm with a given R, and k indicates the size of the seed 
set. For a given R, we run both the StaticGreedy algorithm 
and the CELFGreedy algorithm for 50 times to calculate the 
average relative difference. We only report the results with 
fc = 50 since the results for other k are similar. 

As shown in Figure (2] with R increases, the StaticGreedy 
algorithm quickly approaches to the ground truth while the 
CELFGreedy algorithm converges slowly. For the same R, 
the accuracy of the StaticGreedy algorithm consistently out- 
performs the CELFGreedy algorithm. This confirms that 
the StaticGreedy algorithm can achieve good accuracy even 
when the number R of Monte Carlo simulations is small, 
e.g., R = 100. 

We further evaluate the accuracy of the StaticGreedy algo- 
rithm with respect to the size k of seed set. For this purpose, 
we define Rmin as the minimal R satisfying djt^k < 0.005 and 
depicts Rmin with respect to k. As shown in Figure O for 
all the tested values of k, the StaticGreedy algorithm always 
has a small Rmin , greatly smaller than the value of Rmin for 
the CELFGreedy algorithm. 

3.4.2 Scalability 

Now we analyze the time complexity of the StaticGreedy 
algorithm. For clarity, we use R to denote the number of 
Monte Carlo simulations required by existing greedy algo- 
rithms and use R' to denote the number of Monte Carlo 
simulations required by our StaticGreedy algorithm. In ad- 
dition, n is the number of nodes in the underlying influence 
network, m is the number of edges in the network, m' is the 
average number of active edges in the snapshots obtained 
by sampling the influence network, and k is the number 



Table 1: Time and space complexity of algorithms 



Algorithms 


Time complexity 


Space complexity 


GeneralGreedy 


0{knRm) 


0(m) 


StaticGreedy 


0{R'Tn + knR'm') 


0{R'm') 



of seed nodes. For the StaticGreedy algorithm, the time 
complexity includes two parts: firstly, the time complex- 
ity of generating R' snapshots is 0{R'm); secondly, it takes 
0{knR'm') time to select seed nodes in greedy manner on 
those static snapshots. Thus, the total time complexity is 
OiR'm+knR'm'). For the space complexity of StaticGreedy 
algorithm is 0{R'm'), which is used to store the R' snap- 
shots. The comparison with the general greedy algorithm [B] 
is given in Table [T] 

The time complexity of the StaticGreedy algorithm can be 
further reduced by employing the CELF optimization and 
other optimization strategies. In the next section, we give 
a dynamic update strategy to improve the efficiency of the 
StaticGreedy algorithm. 

3.4.3 Discussions 

In Ref. [H] , the authors propose to reuse snapshots within 
the same iteration. Their motivation is to reduce the com- 
putational cost through simultaneously estimating the influ- 
ence spread of many seed sets. However, reusing snapshots 
within the same iteration cannot guarantee the submodu- 
larity and monotone properties as done by the StaticGreedy 
algorithm. Thus a large number of snapshots are required 
for each iteration of the greedy algorithm. 

In addition, how do we determine the minimum R for 
a specific network and a given spread model? What are 
the factors affecting R in StaticGreedy or previous greedy 
algorithms? We leave these interesting questions as open 
problems in the future. 

4. SPEEDING UP THE STATICGREEDY 

In this section, we propose a dynamic update strategy to 
further speed up the static greedy algorithm. This strategy 
exploits the advantage of static snapshots and calculates the 
marginal gain in an efficient incremental manner. Specifi- 
cally, when a node v* is selected as a seed node, we directly 
discount the marginal gain of other nodes by the marginal 
gain shared by these nodes and v* . 

For a snapshot G'i, we use R{G'i,v) to denote the set of 
nodes which can be reachable from v and use U{Gi,v) to 
denote the set of nodes from which v can be reached. In the 
first iteration, the marginal gain of v is \R{G'i,v)\. In our 
dynamic update strategy, when v* is selected as a seed node, 
we find the set U{G'i, w) for each node w £ R{G'i, v*). Then, 
for every u £ U{G'i,w), we delete w from R{G'i,u). Now, the 
size of R{G'i, u) refiects the marginal gain of u in the next it- 
eration. In this way, we can maintain a dynamically updated 
marginal gain for each node in order to avoid calculating the 
marginal gain from scratch. The detailed implementation of 
the static algorithm, called as StaticGreedyDU, is given 
in Algorithmic) 

Now we analyze the time and space complexity of the Stat- 
icGreedyDU algorithm. For undirected graphs, R{G'i,v) is 



the same to U{G'i,v). We only need to store the information 
of connected components for each snapshot. Thus, the space 
complexity is 0{R'n). It takes 0{R'm) time to generate R' 
snapshots and calculate the initial marginal gain, and 0{kn) 
time to update information for all the related nodes. Then, 
the total time complexity is 0{R'm + kn). For directed 
graphs, let nr = maxt,gy i?(G'i, t;), nu = max^gy U{G'i,v). 
Since it needs to store R{G'i,v) and U{G'i,v) for each node, 
the space complexity is 0{R'nnT + R'nnu). Assume the 
maximum running time to compute R{Gi,v) and U{Gi,v) 
is tr and tu respectively. It takes 0{R'm) time to gen- 
erate snapshots, 0{R'ntT + R'ntu) time to compute the 
initial incremental influence spread, and 0{kR' riTnu) time 
to update information. Hence, the total time complexity is 
0{R'm + R'ntr + R'ntu + kR'nTnu). Note that nr, nu, tr 
and tu are usually very small in real world networks which 
are sparse. 



Algorithm 2 StaticGreedyDU(G,fc,i?) 
1; initialize S — 

2: set the marginal gain s„ = for all v £V 
3: for i = 1 to i? do 
4: generate G'i 

5: compute and record R{G'i, v) and U{Gi, v) for all v G 
V 

6: for each node v £ V do 
7: += \R{G'„v)\ 

8: end for 
9: end for 
10: for r = 1 to fc do 

11: V* = arg max isv} 

vev\s 

12: S = SU{v''} 

13: for i = 1 to i? do 

14: for each node w G R(G'i, v*) do 

15: for each node u £ U{G'i,w) do 

16: delete w from R(G'i,u) 

IT. — s^j-l 

18: end for 

19: end for 

20: end for 

21: end for 

22: output S. 



5. EXPERIMENT 

In this section, we conduct experiments on several real- 
world networks to compare our StaticGreedy algorithm with 
a number of existing algorithms. The experiments aim at il- 
lustrating the performance of our algorithm comparing to 
other algorithms from the following two aspects: (a) ac- 
curacy at finding the seed nodes maximizing the influence 
spread, (b) scalability. 

5.1 Experiment setup 

Datasets. Six real world networks are employed to demon- 
strate the performance of our algorithms by comparing with 
other existing algorithms. These networks include three sci- 
entific collaboration networks and three online social net- 
works. For the three scientific collaboration networks, namely 



Table 2: Statistics of six test real world networks. 



Datasets 


#Nodes 


#Edges 


Directed? 


NetHEPT 


15K 


59K 


undirected 


NetPHY 


37K 


231K 


undirected 


DBLP 


655K 


2M 


undirected 


Epinions 


76K 


509K 


directed 


Slashdot 


77K 


905K 


directed 


Douban 


552K 


22M 


directed 



NetHEPT, NetPHY, and DBLP Q, nodes are authors and 
edges represent the coauthor relationships among authors. 
All of them are undirected networks. NetHEPT is extracted 
from "High Energy Physics - Theory" section of the e-print 
arXiv (http://www.arXiv.org) between 1991 and 2003. Net- 
PHY is constructed from the full paper list of the "Physics" 
section. DBLP, much larger than the former two scientific 
collaboration networks, is extracted from the DBLP Com- 
puter Science Bibliography Database maintained by Michael 
Ley. The three online social networks, namely Epinions, 
Slashdot, and Doubar0, are collected from the websites Epin- 
ions. com, Slashdot.com, and Douban.com. In Epinions, an 
edge (u, v) in this network means u trust v. Slashdot is 
a friend network extracted from a technology-related news 
website Slashdot.com. The last dataset Douban [5] is col- 
lected from douban.com, where users can rate books and 
movies, and follow other users. The edges in this network 
represent foUowships among users. All the three online so- 
cial networks are directed networks. We choose the above 
six networks since them cover a variety of networks with 
sizes ranging from tens of thousands of edges to millions of 
edges. Some basic statistics about these networks are given 
in Table H 

Cascade Models. The adopted cascade models are still 
the two commonly-used IC model: the UIC model and WIG 
model, which we have used in sections. 4.1. For the UIG 
model, we set the propagation probability p — 0.001 for 
Douban and p — 0.01 for other networks. That is because 
the average degree of Douban is nearly ten times than that 
of others. For the WIG model, p{u,v) = 1/kv, where kv is 
the indegree of node v. 

Algorithms. We compare our StaticGreedy algorithm and 
its improved versions with both GELFGreedy and several 
heuristic algorithms, including PMIA, DegreeDiscount and 
Degree. For StaticGreedy, we set R = 100 as default, in 
other word, 100 snapshots are employed for all the datasets. 
Thus, StaticGreedy algorithm may have better accuracy if 
we improve the number of employed snapshots. For GELF- 
Greedy, R is set to 20,000. The PMIA algorithm has a 
tunable parameter 0. We set the value of 8 as done in 
Ref. [8]. The DegreeDiscount heuristic [11] is developed for 
the UIG model with the propagation probability p = 0.001 

^The three networks are downloaded from 'http: / / research| 
microsoft.com/en-us/people/weic/. These scientific collabo- 
ration networks are actually multigraphs, with parallel edges 
between two nodes denoting the number of papers coau- 
thored by the two authors. 

^The former two networks can be downloaded from 
http://snap.stanford.edu/data/. The last one can be ob- 
tained on requirement via email to the authors. 




Figure 4: Influence spread under UIC model on six datasets. 



for Douban and p — 0.01 for other cases. For Degree heuris- 
tic, it simply selects seed nodes according to the degree of 
nodes. 

Since the PMIA algorithm is the state-of-the-art heuris- 
tic [S], we do not consider other heuristics, including distance 
centrality, betweenness centrality, and PageRank heuristic. 

5.2 Experimental results 

We run tests on the six datasets and two IC models, i.e., 
UIC model and WIC model. The tested seed size k are 1, 
5, 10, 15, 20, 25, 30, 35, 40, 45, and 50. For the comparison 
of running time, we only consider the seed size k = 50. In 
addition, the experiments are conducted on a server with 
2.0GHz Quad-Core Intel Xeon X7550 and 64G memory. 

5.2.1 Accuracy comparison 

We first test the accuracy of StaticGreedy algorithms by 
showing the influence spread of the obtained set of nodes. 
For every obtained seed set, 20, 000 Monte Carlo simula- 
tions are used to evaluate its influence spread. Figure |3] 
shows the experimental results on influence spread for the 
six datasets on the UIC model. As shown in Figure Ufa) and 
Figure Ob) , the CELFGreedy algorithm provides the best 
influence spread on the moderate sized networks NetHEPT 
and NetPHY where the CELFGreedy algorithm is still fea- 
sible to run. On the dataset NetHEPT, all the algorithms 
except the Degree heuristic algorithm have the influence 
spread similar to the CELFGreedy algorithm. However, on 
the dataset NetPHY, the differences among these algorithms 
become visible. StaticGreedy algorithm is still very close to 
the CELFGreedy algorithm and outperforms all the other 
algorithms. In fact, the difference between StaticGreedy al- 
gorithm and the CELFGreedy algorithm is less than 2%. For 
the rest networks with large scale where the CELFGreedy 
algorithm is infeasible due to its high computation cost, we 
compare StaticGreedy algorithm with the other three base- 



line algorithms. We can see that StaticGreedy algorithm al- 
ways has the best accuracy compared with other algorithms. 
In particular, for the DBLP and Douban datasets, Static- 
Greedy algorithm significantly outperforms the competing 
algorithms. 

We further test StaticGreedy algorithm on the six test 
datasets with respect to the WIC model. For the moderate 
sized networks NetHEPT and NetPHY where CELFGreedy 
is still feasible to run, as shown in Figure [SJa) and Fig- 
ure [SJ^b), StaticGreedy algorithm has almost the same in- 
fluence spread to the CELFGreedy algorithm, which is the 
most accurate greedy algorithm. Moreover, StaticGreedy al- 
gorithm outperforms the other algorithms with a visible gap. 
For the DBLP, Epinions, Slashodot and Douban networks 
with large scale, StaticGreedy algorithm has consistent ac- 
curacy with the other three baseline algorithms while the 
CELFGreedy algorithm is not scalable to these networks. 

As demonstrated by the results on the six test networks 
with both the UIC and WIC models, StaticGreedy algo- 
rithm has guaranteed accuracy as the original greedy algo- 
rithm and outperforms the state-of-the-art heuristic algo- 
rithms. More important, compared with the original greedy 
algorithm, the guaranteed accuracy of our static greedy al- 
gorithm is obtained with the number of Monte Carlo simu- 
lations dramatically reduced by two orders of magnitude. 

5.2.2 Running time comparison 

We now test the running time of StaticGreedy algorithm 
and the competing algorithms. With respect to StaticGreedy, 
we test the running time of both the StaticGreedyDU algo- 
rithm and the StaticGreedy algorithm with CELF optimiza- 
tion, denoted as StaticCreedyCELF. Figure [S] shows the ex- 
perimental results. For the six networks, StaticGreedyDU 
always 2-7 times faster than StaticGreedy CELF, and such 
seep difference is even more significant for DBLP. For the 
moderate sized datasets, i.e., NetHEPT and NetPHY, the 
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Figure 5: Influence spread under WIC model on six datasets. 



CELFGreedy algorithm is already quite slow. The CELF- 
Greedy algorithm requires several hours while our static 
greedy algorithms only take several seconds. StaticGreedyDU 
and StaticGreedyCELF both reduce the running time by 
three orders of magnitude, compared with the CELFGreedy 
algorithm. More importantly, the reduction of running time 
is obtained by our static greedy algorithm without affecting 
the guaranteed accuracy. The time cost of our two static 
greedy algorithms is comparable to the PMIA algorithm, 
which is the most scalable heuristic algorithm. Note that 
the accuracy of the PMIA algorithm is unguaranteed in the- 
ory. Moreover, StaticGreedyDU algorithm even outperforms 
the PMIA algorithm on three large scale networks, Epinions, 
Slashdot and Douban. It seems that our algorithm has the 
potential advantage on large scale networks compared with 
the PMIA algorithm. For the Degree and DegreeDiscount 
algorithms, they have higher scalability than our method. 
However, this benefit is obtained at the cost of the decrease 
of the obtained influence spread. 

6. CONCLUSION AND FUTURE WORK 

In this paper, we have analyzed the scalability-accuracy 
dilemma of the greedy algorithm, which has roots in the un- 
guaranteed submodularity property in existing implementa- 
tions. To combat this problem, we propose a static greedy 
algorithm to remove the independence among Monte Carlo 
simulations in different iterations. The submodularity prop- 
erty of influence spread is strictly guaranteed in the static 
greedy algorithm. The proposed algorithm achieves the same 
accuracy with the state-of-the-art greedy algorithms while 
the number of Monte Carlo simulations needed is dramati- 
cally reduced by two orders of magnitude. We further give a 
dynamic update strategy to improve the static greedy algo- 
rithm, by applying which our algorithm becomes comparable 



to the most scalable heuristic algorithm. 

For the future work, we will study how to determine the 
minimum number R of Monte Carlo simulations for a specific 
network and a given spread model. We also want to investi- 
gate whether the proposed strategy can be extended to the 
linear threshold model. Finally, we will try to implement 
the proposed static algorithm towards the frame of parallel 
computing, which further improves the computational effi- 
ciency. We also look forward to see more applications of our 
algorithm on real world networks and practical scenarios. 
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